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ABSTRACT 

Although dense local spatial-temporal features with bag-of- 
features representation achieve state-of-the-art performance 
for action recognition, the huge feature number and feature 
size prevent current methods from scaling up to real size prob¬ 
lems. In this work, we investigate different types of feature 
sampling strategies for action recognition, namely dense sam¬ 
pling, uniformly random sampling and selective sampling. 
We propose two effective selective sampling methods using 
object proposal techniques. Experiments conducted on a large 
video dataset show that we are able to achieve better average 
recognition accuracy using 25% less features, through one of 
proposed selective sampling methods, and even remain com¬ 
parable accuracy while discarding 70% features. 

Index Terms —Action recognition, Video analysis, Feature 
sampling 

1. INTRODUCTION 

Given the popularity of social media, it becomes much easier 
to collect a large number of videos from Internet for human 
action recognition. Effective video representation is required 
for recognizing human actions and understanding video con¬ 
tent in such rapidly increasing unstructured data. 

By far, the commonly used video representation for ac¬ 
tion recognition has been the bag-of-words (BoW) model 
m The basic idea is summarizing/encoding local spatial- 
temporal features in a video as a simple vector. Among lo¬ 
cal features, dense trajectory (DT) (2) and its improved vari- 
ant (iDT) 0 provide state-of-the-art results on most action 
datasets 0 . The main idea is to construct trajectories by 
tracking densely sampled feature points in frames, and com¬ 
pute multiple descriptors along the trajectories. 

Despite their success, DT and iDT can produce huge num¬ 
ber of local features, e.x., for a low resolution video in 
320 x 204 with 175 frames, they can generate ^ 52 Mb of 
features a. It is difficult to store and manipulate such dense 
features for large datasets with thousands of high resolution 
videos, especially for real-time applications. 

Existing work focus on reducing the total number of trajec¬ 
tory features through uniformly random sampling at the cost 
of minor reduction in recognition accuracy. Q proposed a 
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Fig. 1. Different feature sampling methods for action recog¬ 
nition. 

part model by which they are able to randomly sample fea¬ 
tures at lower image scales in an efficient way. 0 inter¬ 
polated trajectories using uniformly distributed nearby fea¬ 
ture points. Il investigated the influence of random sam¬ 
pling on recognition accuracy in several large scale datasets. 
However, intuitively, features extracted around informative 
regions, such as human arms in hands waving, should be more 
useful in action classification than features extracted on the 
background. EE) proposed selective sampling strategies on 
dense trajectory features based on saliency maps, produced by 
modeling human eye movement when viewing videos. They 
are able to achieve better recognition results with selectively 
sampled features. However, it is impractical to obtain eye 
movement data for large datasets. 

In this work, we investigate several feature sampling strate¬ 
gies for action recognition, as illustrated in Fig. [I] and pro¬ 
pose two data driven selective feature sampling methods. In¬ 
spired by the success of applying object proposal techniques 
in efficient saliency detection 0, we construct saliency maps 
using one recent object proposal method, EdgeBox mm, 
and selectively sample dense trajectory features for action 
recognition. We further extend EdgeBox to produce proposals 
and construct saliency maps for objects with motion of inter¬ 
ests. More effective features can be sampled then for action 
classification. We evaluated a few feature sampling methods 
on a publicly available datasets, and show that proposed mo¬ 
tion object proposal based selective sampling method is able 
to achieve better accuracy using 25% less features than using 
the full feature set. 

The remaining of this paper is organized as follows: first 
we give a brief introduction about the DT/iDT features and 
other components in our action classification framework, then 
three different feature sampling methods are described. Fi¬ 
nally, we discuss experimental results on a large video dataset. 









Fig. 2. Illustration of selective sampling methods via object proposal algorithms. From left to right, the original video frame, 
dense optical flow field, estimated object boundaries, top 5 scoring boxes generated by EdgeBox, saliency map constructed using 
EdgeBox proposals, estimated motion boundaries, top 5 scoring boxes generated by FusionEdgeBox, saliency map constructed 
using FusionEdgeBox. 


2. DENSE TRAJECTORY FEATURES 

The DT algorithm m represents a video data by dense trajec¬ 
tories, together with appearance and motion features extracted 
around trajectories. On each video frame, feature points are 
densely sampled using a grid with a spacing of 5 pixels for 8 
spatial scales spaced by a factor of 1/ y/2, as illustrated in the 
second column of Fig. [I] Then trajectories are constructed 
by tracking feature points in the video based on dense opti¬ 
cal flows m . The default length of a trajectory is 15, i.e., 
tracking feature points in 15 consecutive frames. The iDT 
algorithm 0 further enhances the trajectory construction by 
eliminating background motions caused by the camera move¬ 
ment. 

For each trajectory, 5 types of descriptors are extracted: 
1) the shape of the trajectory encodes local motion patterns, 
which is described by a sequence of displacement vectors on 
both x- and y-directions; 2) HOG, histogram of oriented gra¬ 
dients HD. captures appearance information, which is com¬ 
puted ina32 x 32 x 15 spatio-temporal volume surrounding 
the trajectory; 3) HOF, histogram of optical flow in ,focuses 
on local motion information, which is computed in the same 
spatio-temporal volume as in HOG; 4+5) MBHx and MBHy, 
motion boundary histograms fT4l are computed separately 
for the horizontal and vertical gradients of the optical flow. 
Both HOG, HOF and MBH are normalized appropriately. 

To encode descriptors/features, we use Fisher vector m 
as in 0. For each feature, we first reduce its dimensional¬ 


ity by a factor of two using Principal Component Analysis 
(PCA). Then a codebook of size 256 is formed by the Gaus¬ 
sian Mixture Model (GMM) algorithm on a random selection 
of 256, 000 features from the training set. To combine differ¬ 
ent types of features, we simply concatenate their I 2 normal¬ 
ized Fisher vectors. 

For classification, we apply a linear SVM provided by LIB- 
svMiim and one-over-rest approach is used for multi-class 
classification. In all experiments, we fix C = 100 in SVM as 
suggested in 0 

3. FEATURE SAMPLING STRATEGIES 

In the following, we describe three feature sampling methods, 
that are different from using all trajectories and related fea¬ 
tures computed on dense grids as in the DT/iDT algorithms. 
All three methods can derive a sampling probability for each 
trajectory feature to measure whether it will be sampled or 
not, denoted by a. For example, a = 0.8 means we sample 
trajectory features with probability greater or equal to 0.8 for 
action recognition. 

3.1. Uniformly Random Sampling 

Following previous work we simply sample dense tra¬ 

jectory features in a random and uniform way. The sampling 
probability, cr, for each trajectory is the same. In experiments, 




















we randomly sample 80%, 60%, 40% and 30% of trajec¬ 
tory features, and report their action recognition accuracies 
respectively. 

3.2. Selective Sampling via Object Proposal 

EdgeBox fna is one of efficient object proposal algorithms 
CD published recently. We utilize it to construct saliency 
map on each video frame, and sample trajectory features with 
respect to computed saliency values. 

In EdgeBox, given a video frame, object boundaries are 
estimated via structured decision forests C3, and object con¬ 
tours are formed by grouping detected boundaries with sim¬ 
ilar orientations. In order to determine how likely a bound¬ 
ing box contains objects of interests, a simple but effective 
objectiveness score s 0 bj was proposed, based on the number 
of contours that are wholly enclosed by the box. We allow 
at most 10, 000 boxes in different sizes and aspect ratios to 
be examined for a frame. Fig. [2] illustrates estimated object 
boundaries and top 5 scoring boxes generated by EdgeBox in 
the third and forth columns respectively. 

Given thousands of object proposal boxes, on a video 
frame, we construct a saliency map through a pixel voting 
procedure. Each object proposal box is considered as a vote 
for all pixels located inside it. We normalize all pixel votes 
into [0,1] to form a saliency probability distribution. Saliency 
map examples are illustrated in the fifth column of Fig. [2] 
Warmer colors indicate higher saliency probabilities. 

Based on constructed saliency maps of a video, we are 
able to selectively sample trajectories and related features. If 
the saliency probability of the starting pixel of a trajectory is 
higher than a predefined sampling probability cr, the trajec¬ 
tory and related features will be sampled. In experiments, we 
report recognition accuracies for a with 0.2, 0.4 and 0.6 re¬ 
spectively. 

3.3. Selective Sampling via Motion Object Proposal 

Although by stacking boxes generated via EdgeBox are able 
to highlight regions in a frame with saliency objects, con¬ 
structed saliency map may not be suitable for sampling fea¬ 
tures for action recognition. For example, in the last row of 
Fig.[2| the optical flow field (second column) clearly indicates 
the region with motion of interests for action recognition is lo¬ 
cated around actor’s head and arms, while top scoring boxes 
and constructed saliency map via EdgeBox incorrectly focus 
on actor’s legs. Thus, in order to incorporate with motion 
information, we propose a motion object proposal method, 
named FusionEdgeBox, where a fused objectiveness score is 
measured on both object boundaries and motion boundaries. 

The fusion score function is defined as 

^fusion = Cy<Sobj ~\~ (3 s motion (1) 


where s 0 bj is the original EdgeBox score, 5 mo ti 0 n is the pro¬ 
posed motion objectiveness score, and balance parameters a 
and (3. We empirically fix a = /? = 1 for all experiments, 
^motion is defined similar as s 0 bj> i- e -> based on the number of 
wholly enclosed contours in a box. However, s mo ti 0 n utilizes 
contours that are grouped from motion boundaries, which are 
estimated as image gradients of the optical flow field. Motion 
boundary examples are shown in the sixth column of Fig. [2] 

By applying the fusion score into the EdgeBox framework, 
we are able to generate a set of proposal boxes, and construct 
the saliency map for feature sampling as well. Examples of 
top 5 scoring fusion boxes and constructed saliency maps are 
illustrated in last two columns of Fig. [^respectively. Compar¬ 
ing with examples generated by the original EdgeBox (shown 
in columns 3-5), we can see that FusionEdgeBox is able to 
better explore regions with motion of interests, which is use¬ 
ful for action feature sampling (verified by later experiments). 

Similarly, we report recognition accuracies using sampled 
trajectory features for cr with 0.2, 0.4 and 0.6 respectively. 

4. EXPERIMENTS 

We have conducted experiments on one publicly available 
video datasets, namely J-HMDB H8lL which consists of 920 
videos of 21 different actions. These videos are selected from 
a larger dataset HMDB liT9l . J-HMDB also provides anno¬ 
tated bounding boxes for actors on each frame. We report the 
average classification accuracy among three training/testing 
split settings provided by J-HMDB. 

In the following, we evaluate action recognition on J- 
HMDB using sampled trajectory features through different 
methods, and discuss their performance. We also compare 
obtained accuracies with a few state-of-the-art action recog¬ 
nition algorithms. 

4.1. Influence of Sampling Strategies 

In addition to three introduced feature sampling methods, to 
better understanding trajectory features, we investigate the 
fourth sampling method using annotated bounding boxes for 
actors. We sample trajectory features, if the starting point of 
a trajectory locates inside an annotation box. Similar strategy 
was proposed in ED. and we name it as GT. 

Figure [3] and [4] plot average classification accuracies over 
all classes for all sampling methods under different sam¬ 
pling rates, using the DT feature and iDT feature respec¬ 
tively. In general, through feature sampling, we are able to 
achieve higher performance than directly using all features, 
since noise background features have been discarded. 

Specifically, for the DT feature, we can see that: 1) tra¬ 
jectory features sampled inside annotated bounding boxes, 
achieves higher accuracy than using all features. Similar phe¬ 
nomena has been observed in CD as well which indicates 
DT features located around human body are more important 



Fig. 3. Average accuracies using the DT feature. 



Fig. 4. Average accuracies using the iDT feature. 


than features extracted on other regions. 2) Selective sam¬ 
pling methods achieve higher accuracies than random sam¬ 
pling given similar number of sampled features. It shows that 
sampling DT features from certain regions is important for ac¬ 
tion recognition, and object proposal based strategies are able 
to detect these regions. 3) Proposed selective sampling via 
motion object proposal outperforms other sampling methods, 
even outperforms the one based on annotated bounding boxes. 
It verifies that proposed FusionEdgeBox method is useful for 
exploring regions of interests for action recognition. 

For the iDT feature, however, different sampling method 
result in similar accuracies. Random sampling outperforms 
others slightly, especially when the number of sampled fea¬ 
tures is small. The reason may be that, by eliminating back¬ 
ground motion caused by the camera movement, the iDT fea¬ 
ture is more compact and meaningful than the DT feature, 
e.x., the average number of iDT features per video is much 
lower than it of DT feature. Random sampling is able to bet¬ 
ter preserve the original iDT feature distribution than selective 
samplings which have quite large sampling bias. 


Method 

J-HMDB 

Memory (GB) 

Dense Trajectory |2) 

62.88% 

5.4 

Improved Dense Trajectory [31 

64.52% 

4.2 

Peng et al. l20l w/ iDT 

69.03%* 

4.2 

Gkioxari et al. lOTl 

62.5% 

- 

Discard 20% ~ 25% features 

DT 

Random 

62.33% 

4.3 

EdgeBox 

65.33% 

4.5 

FusionEdgeBox 

65 . 91 % 

4.0 

iDT 

Random 

65 . 49 % 

3.4 

EdgeBox 

65.32% 

3.6 

FusionEdgeBox 

65.11% 

3.5 

Discard 70% ~ 80% features 

DT 

Random 

59.90% 

1.1 

EdgeBox 

58.51% 

1.4 

FusionEdgeBox 

60 . 71 % 

1.4 

iDT 

Random 

62 . 34 % 

1.3 

EdgeBox 

58.85% 

1.2 

FusionEdgeBox 

60.87% 

1.3 


Table 1. Comparison to state-of-the-arts in terms of average 
accuracy and feature size. * It leverages an advanced feature 
encoding technique, stacked Fisher vector. 


4.2. Comparisons to state-of-the-arts 

Table [T] shows comparisons of feature sampling methods in 
different sampling rates with the state-of-the-arts. Sampling 
methods achieve better average accuracies than a few state-of- 
the-arts using same classification pipeline, with ~ 20% less 
features. It is interesting to observe that, even discarding more 
than 70% features, random sampling and proposed selective 
sampling still are able to remain comparable performance. 


5. CONCLUSIONS 

In this work, we focus on feature sampling strategies for ac¬ 
tion recognition in videos. Dense trajectory features are uti¬ 
lized to represent videos. Two types of sampling strategies are 
investigated, namely uniformly random sampling and selec¬ 
tive sampling. We propose to use object proposal techniques 
to construct saliency maps for video frames, and use them to 
guide the selective feature sampling process. We also pro¬ 
pose a motion object proposal method that incorporate object 
motion information into object proposal framework. Experi¬ 
ments conducted on a large video dataset indicate that sam¬ 
pling based methods are able to achieve better recognition 
accuracy using 25% less features through one of proposed se¬ 
lective feature sampling method, and even remain comparable 
accuracy with discarding 70% features. 
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